Introduction

In this project, I’m interested in exploring data from Prosper, a peer-to-peer lending platform. As this is a huge data set (~114,000 observations with 81 variables), I’ll be limiting myself to a subset of the most interesting variables. It should be noted that the data from this were last updated 3/11/2014.

In particular, I’m most interested in the following variables (for the reasons included):

Prosper Variable Why It’s Interesting
ListingCreationDate This is useful for contextualizing the data, especially given the Great Recession of 2008-09
Term Useful for understanding what time span a borrower felt comfortable with
LoanStatus This is the primary outcome variable, helping to determine things like Prosper’s average default rate and what types of borrowers tend to default
ClosedDate Useful for knowing what average payoff (and permanent default) timelines look like
BorrowerAPR This will provide broad context for the cost of the loan, beyond what is provided in BorrowerRate
BorrowerRate This gives an idea of how much the borrower emphasized monthly payments vs. total loan cost by comparing to LenderYield
ListingCategory This will again be helfpul for contextualizing the data and letting me know what the loan’s stated purpose was/is.
BorrowerState This may provide interesting extra data in the geographic sense (e.g. comparing loan defaults in a given state with the amount of home foreclosures in that state for the year in question)
EmploymentStatus A useful categorical variable for determining likelihood and speed of repayment based on existing employment
CreditScoreRangeLower Provides economic context for the borrowers
CreditScoreRangeUpper Provides economic context for the borrowers
CreditScoreMidpt A variable calculated in this project, using the midpoint of the upper and lower credit scores recorded, for reducing variable count in analyses
DelinquenciesLast7Years Much like other variables here, useful for understanding likelihood of defaults/full repayments of borrowers with differing amounts of historical delinquency.
StatedMonthlyIncome Provides economic context for the borrowers
LoanOriginalAmount Important base information about the nature of each loan
MonthlyLoanPayment Important base information about the nature of each loan
##      ListingCategory ListingCategory..numeric.
## 1      Not Available                         0
## 2   Home Improvement                         2
## 3      Not Available                         0
## 4         Motorcycle                        16
## 5   Home Improvement                         2
## 6 Debt Consolidation                         1

Univariate Plots Section

Let’s take a 1D look at each of our variables first to make sure we understand what we’re working with. We’ll start with taking a look at the data set’s structure and size.

## 'data.frame':    113937 obs. of  18 variables:
##  $ ListingCreationDate      : POSIXct, format: "2007-08-26 19:09:29" "2014-02-27 08:28:07" ...
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate               : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield              : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ DelinquenciesLast7Years  : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment       : num  330 319 123 321 564 ...
##  $ ListingCategory          : Factor w/ 21 levels "Auto","Baby&Adoption",..: 14 9 14 13 9 6 6 9 15 15 ...
##  $ CreditScoreMidpt         : num  650 690 490 810 690 ...

We are dealing with a data set that contains 113,937 observations (AKA loans) of 18 variables.

Loans as a Function of Time

This is a strange functional dependence of loan count (per month) over time. It looks like business was booming in the Prosper loan marketplace from its inception in 2006 until 2009, then it did a hard reset. This could be due to a few things:

  1. 2008 was the beginning of The Great Recession in the United States. It’s possible that lenders were skiddish about loaning money through new loan mechanisms like peer-to-peer and, likely, they were skiddish about making loans in general. Also, borrowers in the Prosper marketplace, much like in many other American markets at the time, probably also avoided debt as much as possible, while trying to save cash and pay off pre-existing debts as fast as possible.

  2. According to Prosper’s Wikipedia page, the marketplace faced a class action lawsuit starting in November 2008 that, among other things, took issue with the way Prosper managed its loans and the geographic variability it allowed for in its lenders. As a result, the pre-2009 model Propser had used for setting interest rates on loans (an eBay-style Dutch auction system wherein lenders could bid on the interest rates and loans) was replaced with a system wherein Prosper set interest rates itself, using data it had on hand about the borrower and other elements of the lending ecosystem. It is possible that this change in business model was jarring to existing users who then became disenchanted during the adjustment period to the new model.

  3. Along the same lines as the preceding point, there is one more possibility: simple geographic limitation. As part of registering its new model and loans with the SEC in July 2009, Prosper also limited its pool of allowable lenders to 28 states in the US and only allowed borrowers from 47 states. While the Wikipedia article referenced earlier doesn’t specify what Prosper’s pre-existing geographic coverage for lenders and borrowers was, it seems likely that it was broader than the post-2009 coverage, suggesting that Prosper’s new model effectively locked some of its pre-existing customers out of its marketplace, which would of course also account for the significant drop in loans in 2009.

While it is certainly possible that The Great Recession caused this drop in business we’re observing, I think it’s far more likely that the lawsuit and change in business model are mostly to blame. The loan counts from May 2008 to November 2008 show a decline, but it is relatively steady in nature. The sudden zero-ing out of business from November 2008 to May 2009 seems too drastic to simply chalk up to a strained market.

A valid question to ask: is it useful to compare loans made 2006-2008 to the rest of the portfolio in this data set? For now, I’m going to keep those earlier loans, as the Great Recession period likely has interesting elements that echo throughout the later loans, but we’ll need to keep a wary eye on that time period, given the amount of change that Prosper experienced. For example, the auction-style rate-setting likely produced defaults at a similar rate to the later preset rates model, but we should expect a lot more variability in rates pre-2009. So it will likely depend upon the variables we choose to use and may require filtering out earlier loans depending on the impact they have on our analysis.

While it seems fairly likely that we’ve explained the 2008 oddity in ListingCreationDate here, I do still wonder about the late-2012/early-2013 dip in loans. My research thus far has not indicated a good reason why this may have occurred. Some additional investor funding in Prosper as a company occurred during this time, but there are other (smaller and bigger) rounds of investing that occurred outside of that time period, so that doesn’t seem like a sufficient explanation.

Distribution of Loan Term Lengths

##  int [1:113937] 36 36 36 36 36 60 36 36 36 36 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00

Based upon this distribution and the summary data from earlier, it is clear that there are really only 3 terms available to borrowers: 12, 36, and 60 months. As such, we’ll make this variable into an ordered factor, as any further analysis using this variable (e.g. a regression) should not assume Term is a continuous variable when it really isn’t.

There we go! This is much more intuitive to view than the original continuous-variable version. We can now see clearly that 36-month term loans comprise about 77% of all loans in our data set, 60-month terms are around 22%, leaving only about 1-2% of all loans as 12-month ones.

Distribution of Loan Status

##  Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

So we have a bunch of different statuses, including 6 Past Due statuses that are given further context by the length of delinquency. Note here that “Chargedoff” is a term usually applied by the lender to indicate that the loan is Defaulted and unlikely to be collected upon. As this is effectively a permanent status, it is perhaps unsurprising that it should have the third highest count of the statuses, as it can never be turned into another status over time. Let’s take a look at this visually.

Given how large the differences are between the dominant statuses (“ChargedOff”, “Completed”, “Current”, and “Defaulted”), let’s transform to a log scale, shall we?

The shape of this distribution is now a little more obvious for low signals, but it still seems like something is missing visually. Let’s try making LoanStatus an ordered factor, with the following ranking (with more positive loan outcomes being ranked more highly and thereby receiving a lower Factor Level Rank):

Factor Level Rank Factor Level
1 Completed
2 FinalPaymentInProgress
3 Current
4 Cancelled
5 Past Due (1-15 days)
6 Past Due (16-30 days)
7 Past Due (31-60 days)
8 Past Due (61-90 days)
9 Past Due (91-120 days)
10 Past Due (>120 days)
11 Defaulted
12 Chargedoff

“Cancelled” is placed in the rankings where it is due to the variety of possible scenarios in which cancellation could occur. It is not always objectively a good outcome, but is generally viewed as more positive than being past due, defaulted, or considered charged off.

##  [1] "Chargedoff"             "Defaulted"             
##  [3] "Past Due (>120 days)"   "Past Due (91-120 days)"
##  [5] "Past Due (61-90 days)"  "Past Due (31-60 days)" 
##  [7] "Past Due (16-30 days)"  "Past Due (1-15 days)"  
##  [9] "Cancelled"              "Current"               
## [11] "FinalPaymentInProgress" "Completed"

OK, now we’ve got these statuses ordered the way that is most useful to us, so let’s see how these distributions (linear and log10) change.

This is helpful. It’s now more clear that the difference in counts in Past Due loans is not really significant, except for a slightly larger number for the 1-15 days bucket and a slightly smaller number for the >120 days bucket, both of which are understandable. For the former, it’s easy to be slightly late on your payments through forgetfulness or a misalignment of your income timeline and loan payment timeline. For the latter, I’d expect that many people, when informed that they are at risk of being put into Default status, get their payments in to avoid being forced to pay off the loan fully in that moment.

Distribution of Closed Date

First, we need to convert closing dates from a factor with string levels to date and time values that R recognizes, like we did with the loan listing dates.

##  POSIXct[1:113937], format: "2009-08-14" NA "2009-12-17" NA NA NA NA NA NA NA NA ...

OK, time to plot this up!

At first glance, the closing dates distribution doesn’t make a lot of sense. Why don’t we see a sudden dropoff in 2008-2009 like we did with the listing creation dates? And why are the numbers for this distribution so much lower? The key here is that only a subset (and clearly a limited one at that - 48.4% here) of the entire loan portfolio would have a closing date (only those with a status of Cancelled, Completed, Chargedoff, or Defaulted in fact). While these are a large portion of the portfolio, they are certainly not the entirety of it, explaining the significantly lower counts. And we can see a dip in the close dates around 12/2011. Given that we know more than 75% of the loans have 3 year terms, this dropoff shouldn’t suprise us: this is almost exactly 3 years after the 2008-2009 dropoff in listing creations.

Distribution of Borrower APRs and Rates

As the APR (Annual Percentage Rate) and the base interest rate of loans are considered linked (with the APR simply taking into account origination and maintenance costs of the loan that the base rate does not), I decided to explore these together.

Oddly enough, it seems like we have a broad peak at 20.05% (defined by the median of the APRs below 35%) and a narrow jump at 35.7%. Yikes, those are some high rates!

Oddly enough, it seems like we have a broad peak at 14.99% (defined by the median of the APRs below 22.5%), another broad peak at 26.24% (this time defined by the median of the Rates between 22.5% and 31%) and a narrow jump at 32%. As we had guessed, these values are lower than the APR values, as they reflect only the bare interest rates of the loans, not the total cost of origination and maintenance too.

What’s interesting is the functional form however - why are there so many peaks and some high-rate delta functions? Perhaps the answer lies in the time series data. Let’s take a look at these same data, but from pre- and post-2009.

How strange! I had expected that some of the features would disappear in one time period or the other, but both exhibit this low-rate broad peak and high-rate spikes. I’m not sure what would be causing that, but it is interesting to note that the overall trends of both rate types seem to be smoother in 2009 and later, perhaps as a result of the new rate-setting mechansim Prosper employed. However, that time period also has more data points, so the smoothing could just be a reflection of the size of the sample (although with 29,056 loans/25.5% of the total data set we’re working with listing prior to 1/1/2009, it seems like the sample size would be more than sufficient).

Lender Yield

Let’s now explore how well lenders did in allowing their money to be borrowed out through the Prosper platform.

## [1] "Structure and Summary Statistics of LenderYield"
##  num [1:113937] 0.138 0.082 0.24 0.0874 0.1985 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

These look…extremely similar. In fact, a look at the data catalog makes us realize that the lender yield is a calculation that simply subtracts the servicing fee from the interest rate of a given loan. We would thus expect a perfect correlation between these variables. We’ll make sure to test that in section on Bivariate Plots/Analysis.

Listing Category

As we’ve done previously, here we’ll inspect the structure of the data, run some summary statistics, and generate a basic plot to make sure we know what we’re looking at. As we’ve already gone through the trouble of providing human- readable values for this variable, we won’t bother investigating the numeric version of this variable.

##  Factor w/ 21 levels "Auto","Baby&Adoption",..: 14 9 14 13 9 6 6 9 15 15 ...
##               Auto      Baby&Adoption               Boat 
##               2572                199                 85 
##           Business Cosmetic Procedure Debt Consolidation 
##               7189                 91              58308 
##    Engagement Ring        Green Loans   Home Improvement 
##                217                 59               7433 
## Household Expenses    Large Purchases     Medical/Dental 
##               1996                876               1522 
##         Motorcycle      Not Available              Other 
##                304              16965              10494 
##      Personal Loan                 RV        Student Use 
##               2395                 52                756 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

## [1] "What is the ranking of these categories, from least count to greatest?"
##  [1] "RV"                 "Green Loans"        "Boat"              
##  [4] "Cosmetic Procedure" "Baby&Adoption"      "Engagement Ring"   
##  [7] "Motorcycle"         "Student Use"        "Vacation"          
## [10] "Wedding Loans"      "Large Purchases"    "Taxes"             
## [13] "Medical/Dental"     "Household Expenses" "Personal Loan"     
## [16] "Auto"               "Business"           "Home Improvement"  
## [19] "Other"              "Not Available"      "Debt Consolidation"

We can see in the linear and log-transformed plots here that by far the most common use for these loans is debt consolidation, even if the next-highest-count category had all of the Not Available and Other counts added to it. Speaking of, these are the categories with the next highest counts, resp., which really isn’t that surprising since they are catch-all categories.

When only considering explicit categories (in other words, not Not Available or Other), the top 5 in descending order are:

  1. Debt Consolidation
  2. Home Improvement
  3. Business
  4. Auto
  5. Personal Loan

I don’t see anything particularly surprising about this. If anything, the biggest surprise to me out of all of this is that Wedding Loans and Engagement Ring rank higher in frequency than Baby&Adoption. Anyone with kids knows that they are far more expensive than the vast majority of weddings and rings!

Borrower State

Let’s take a look at some geographic diversity aspects of these loans.

##  Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150
## [1] "Loans by state, ordered by increasing frequency"
##  [1] "ND" "ME" "WY" "IA" "SD" "AK" "VT" "DE" "MT" "DC" "WV" "HI" "RI" "NM"
## [15] "NH" "ID" "NE" "MS" "AR" "UT" "LA" "OK" "KY" "KS" "NV" "SC" "CT" "AL"
## [29] "TN" "OR" "WI" "AZ" "IN" "CO" "MA" "MN" "MO" "MD" "PA" "WA" "NC" "NJ"
## [43] "VA" "MI" "OH" "GA" ""   "IL" "FL" "NY" "TX" "CA"

There are clearly some states dominating the borrower landscape here. However, I wonder: is this dominance in borrower count a result of simply population differences between the states, or something else (e.g. CA is where Prosper is headquartered and CA residents always love to jump on the newest tech bandwagon as soon as possible). Normalizing these data by state population (in other words, number of loans per capita) is something we’ll tackle in the bivariate analysis section of this report. Stay tuned!

Employment Status

Let’s take a look at the structure and distribution of another interesting explanatory variable: borrowers’ employment statuses.

##  Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

It looks like this variable may be a tad tricky, as one would expect that “Full-time”, “Part-time”, and “Self-employed” would all be part of the set “Employed” but they’re treated as separate levels. Likely this means that Prosper started asking for more (or less) detail about employment status as time went on. Let’s plan to investigate this variable over time in the bivariate analysis section.

By far, the borrowers were employed, and (even though Employed is a pretty low-resolution value for this variable) likely those who are just employed are full-time. Later we’ll take a look at the correlation between employment types and loan status, as it seems plausible that those who are employed full-time will find it easier to make payments on time and close their accounts in good standing.

Credit Scores

Now we’ll explore the credit scores of different borrowers. As these were provided in the original data set as upper and lower ends of ranges, I chose to create a variable in the data frame that was the midpoint value of each range, to simplify analysis without losing resolution. Even though there are a limited number of possible values for this variable, I chose to keep it as a numeric instead of a factor, so regression analysis could be done if desired.

##  num [1:113937] 650 690 490 810 690 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

It’s not entirely obvious why there are values near zero here, likely that’s an artifact of faulty data (e.g. a missing digit from the credit reporting agency or poor manual data entry). We ignore that unlikely outlier and zoom in on the 400 to 900 range.

In the zoomed in range, we see an interesting and somewhat unexepcted distriution of credit scores. It looks like there’s a small spike in loans for those with credit scores in the 520 - 540 range, but the main peak is at 680 (unsurprising, since our early descriptive summary analysis indicates a median of 689). This is not such a strange result, as the average American credit score in 2017 was 700.

Delinquencies

Another interesting potential predictor of loan payments and successful loan completion is borrower historic credit delinquencies. Let’s take a look at that.

##  int [1:113937] 4 0 0 14 0 0 0 0 0 0 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990

Clearly, a majority of borrowers had no delinquencies on their records. This is some seriously long-tailed data. Let’s do a log10 transform to see what’s really going on here.

How odd! Why would there be a such an up-tick in delinquencies at the value 99? Likely this is a reporting artifact (e.g. credit bureaus only reported 99 for borrowers with 99+ delinquencies). A quick Googling doesn’t shed much light on this, but given that the trend is quite smooth on a log-scale down to 98, I’m willing to assume that this isn’t a meaningful data point. Interestingly, while credit scores are pretty normally distributed in this group, delinquencies are not. This likely indicates that delinquencies have a strongly negative effect on scores, given that the majority of our borrowers have no delinquencies on record. But really, we don’t have sufficient data for that, we’d need to do a correlational analysis on credit scores and delinquencies. Don’t worry - we’ll do just that soon enough!

Borrowers’ Monthly Incomes

Another interesting explantory variable - monthly income. This is a nice continuous variable that can give us more resolution than EmploymentStatus when it comes to things like loan payment status. Let’s dig into it.

##  num [1:113937] 3083 6125 2083 2875 9583 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

These data clearly have some outliers. Using the standard definition of an outlier as being data with values >= 1.5 * IQR + 3rd_quartile, we find that outliers here are defined as being greater than $12262.50. Let’s look at some plots limiting ourselves to that.

Taking this outlier definition results in a loss of about 5% of the records, which seems reasonable for exploratory purposes. Given that only about 15% of US households make more than this, I think it’s a pretty reasonable cutoff for our purposes. There’s not too much that stands out in these data, other than the pretty regular spikes every $400. Perhaps this has something to do with the human tendency to round to the nearest 500 or 1000? That’s not clear.

Also, I should note that the peak of this variable is around 4000, although this is a bit further removed from the median value of 4667 than I was expecting.

Amount Loaned

In order to understand the nature of the loans in our data set, we should also probably know how much each was worth!

##  int [1:113937] 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

There are peaks at $10000, $15000, and $25000 - none of these are particularly suprising, given how nice and round those values are. They were likely just a stab in the dark by the borrower regarding how much they actually needed. The surprising data point though, is also the largest peak: $4000. What is so common of an item/service that it requires a $4000 loan? Given that the largest loan category was debt consolidation, perhaps $4000 is a popular limit to put on revolving debt (e.g. credit cards)? That doesn’t seem to explain it, as the average credit card limit in 2015 was $8042. How curious.

Monthly Payments

We now finally come to our final variable of interest, the monthly amount due on the loans. I expect that this will correlate strongly with the loan principal and also likely the monthly borrower income. For now, though, let’s just look at its overall spread in our portfolio of loans.

##  num [1:113937] 330 319 123 321 564 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

Oddly enough, this seems multi-modal with peaks at $200, $400, $525, $850, $1150, and $1350. I don’t really know what to make of those peaks, but perhaps our work with more than one variable will illuminate this pattern a bit more (e.g. looking at monthly payment vs. APR).

Univariate Analysis

What is the structure of your dataset?

It is a data set of 113,937 observations (loans) and 18 variables. The variables are (grouped whenever they are related):

  1. ListingCreationDate
  2. Term
  3. ClosedDate
  4. LoanStatus
  5. BorrowerAPR and BorrowerRate
  6. LenderYield
  7. ListingCategory and ListingCategory..numeric.
  8. BorrowerState
  9. EmploymentStatus
  10. CreditScoreRangeLower, CreditScoreRangeUpper, and CreditScoreMidpt
  11. DelinquenciesLast7Years
  12. StatedMonthlyIncome
  13. LoanOriginalAmount
  14. MonthlyLoanPayment

ListingCategory, EmploymentStatus, and BorrowerState are unordered factors; Term and LoanStatus are ordered factors. The levels of them are as follows:

  • ListingCategory: 21 types of end-uses for loans, including things like Debt Consolidation, Baby&Adoption, and Engagement Ring.
  • EmploymentStatus: “”, Employed, Full-Time, Not Available, Not employed, Other, Part-time, Retired, and Self-employed
  • BorrowerState: all 50 US states + Puerto Rico and Washington, DC
  • Term: 12 < 36 < 60 (all in units of months)
  • LoanStatus: ordered as follows (least to greatest)
    • Chargedoff
    • Defaulted
    • Past Due (>120 days)
    • Past Due (91-120 days)
    • Past Due (61-90 days)
    • Past Due (31-60 days)
    • Past Due (16-30 days)
    • Past Due (1-15 days)
    • Cancelled
    • Current
    • FinalPaymentInProgress
    • Completed

What is/are the main feature(s) of interest in your dataset?

The main features I’m most interested in are the loan statuses (trying to understand what characteristics tend to make for a well-paid or delinquent loan) and the listing date for each loan, given the complete drop in loans as of 2009, likely due to the Great Recession and/or Prosper significantly changing its rate-setting model. I suspect that there will be important changes in the explanatory variables as a function of time. Mainly, however, I’m interested in being able to build a predictive model of good and poor borrowers.

What other features in the dataset do you think will help support your

###investigation into your feature(s) of interest?

The other variables are mostly useful in this investigation as well, particularly CreditScoreMidpt, ListingCategory, StatedMonthlyIncome, DelinquenciesLast7Years, MonthlyLoanPayment, BorrowerRate, and EmploymentStatus. Here are the major outcomes observed thus far:

Did you create any new variables from existing variables in the dataset?

For the credit score data, I created CreditScoreMidpt in order to condense and simplify the data from CreditScoreRangeUpper and CreditScoreRangeLower by taking the midpoint between this pair of variables for each observation.

I also mapped the original numercially-coded ListingCategory..numeric. variable to a new variable ListingCategory that transforms the original numeric encoding into human-readable categories.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

As stated earlier, there is a curious drop in loan activity in 2009 that I am going to investigate further. Other oddities I noted:

  1. The loans largely have very high interest rates, with a surprisingly large number of loans taking an abnormally high value of APR and Rate, outside of the relatively smooth distribution of rates seen for the rest of the portfolio
  2. Distribution of borrower credit scores is mostly normal and centered around 680, with some outliers near zero and and a small peak around 520 - 540.
  3. The majority of borrowers (~67%) have no delinquencies in the last 7 years, but there are some who have 99 or more.
  4. Monthly loan payments trend downward in frequency with increasing payment amount, not a surprising trend. However, they were multi-modal, indicating a variety of payments were particularly popular for some reason, trending from $200/month (the most popular) to $1350/month

Bivariate Plots Section

Let’s start by doing a matrix of 2D plots to compare each variable to the others (I expect this to be VERY dense, but let’s give it a go to see if there’s anything unexpected). At this point, I feel comfortable ignoring LenderYield, ListingCategory..numeric., and the upper and lower credit score range bounds. I’ve also had to remove a few other columns to make this graphic a bit less unwieldy, but we’ll definitely be exploring more than what we see here.

Interesting! I’m not seeing some of the trends I was expecting (e.g. BorrowerRate and StatedMonthlyIncome don’t seem highly correlated), but there are some trends here that are unexpected. For example:

There are a few other variable combinations that I expect intuitively will be interesting to investigate but that I couldn’t fit into the ggpairs analysis, so we’ll explore those individually.

But first things first: we have a to-do item lingering from our univariate analysis that we should address. BorrowerState showed some behavior that we wanted to explore further in this bivariate analysis section, so let’s do that!

Y vs. BorrowerState

Let’s take a look at the relationship BorrowerState has with some of our other variables. As you will see frequently in this Bivariate Analysis section, our we will always take a look at the relationship each variable has with our outcome variable, LoanStatus. However, in order to better understand how other predictor variables interact with one another, we will at times plot those instead. Let’s see what falls out!

First, we’ll reproduce our simple counts of loans over states from earlier.

Borrower State Loans Per Capita

As seen earlier, CA, TX, NY, FL, and IL dominate these results. But what if we account for the populations of these states? What is the Prosper loan per capita number for each state? After all, that’s the more relevant measure here: what state has the greatest percentage of its population interested in Prosper loans? Is it a state like CA that loves new software platforms, or somewhere else?

Well look at that! Indeed, we see that the leading states are no longer the ones we had originally identified (except in the case of IL - they must really like Prosper!). On a loans per capita measure, DC leads the pack by a substantial margin. It’s followed by GA, MD, IL, CT, and OR. Well this blows any theories I may have had out of the water. I wonder if this is purely a function of the time spent in the Prosper marketplace by these new leading per-capita states (e.g. those identified earlier, such as NY and CA, may not have gotten into the Prosper marketplace as soon as these new leader states, allowing those states to generate more loans when integrated over time)? Or perhaps it has something to do with people in these new leaders states on average living more paycheck-to-paycheck and needing loans to make ends meet? This would likely be reflected in their credit score, so we can check that out too.

BorrowerState and ListingDate

Per my first thought in the preceding section, let’s take a look at how the listing creation dates trend as a function of the borrower’s state. In theory, if the surprising loans per capita results are purely an effect of the time states have spent in the marketplace, then I should see that effect somehow in the time series data.

The states we noted as having the highest number of loans per capita are highlighted here using red dashed lines, with the year 2009 being shown with a blue line. From this, it certainly doesn’t look like those states have been in the Prosper marketplace an appreciably longer time than the leaders from the original plot (e.g. CA and NY). Let’s take a closer look at these different states in a more informative way, with some violin plots.

Looking at this, we can see a few interesting items of note among the top states:

  1. None of the per-capita leader states I identified earlier seem to have limited their participation to the pre-2009 timeframe, so they weren’t forced out of the market during Prosper’s redefinition of its services. In fact, the only ones that I see stopping borrowing activity after 2009 (from the larger earlier graph) are IA, ME, and ND. Only 3 of them (GA, IL, and OR) even have more than 25% of their loans from that time period and none have their median (denoting by the dots in the violin plots) near 2009 or earlier.

  2. DC has very limited loan activity prior to 2009, with its engagement not even beginning at a small scale until 2008. This is especially curious given that it led in terms of loans per capita. It seems that it had aggressive activity relative to its population size in a much shorter time period than the others.

Well, given these data, it doesn’t look like my theory makes sense. The loans-per-capita leader states haven’t had a significantly longer exposure to the market than NY or CA, and in fact DC is a relative newcomer even though it leads for loans-per-capita! In that particular case, it’s possible that DC’s urban-only makeup could account for this, but we don’t have the data on hand to explore that possibility.

As mentioned earlier, let’s explore borrower states as function of average credit score next.

BorrowerState and Credit Score

It stands to reason, given the personal loans nature of the Prosper marketplace, that states with a populace that is highly indebted relative to other states would have an outsized number of loans per capita. To explore this possibility, let’s compare borrower states to the median credit score of borrowers from those states. While credit score is not a direct measure of leverage per capita (which would be better served by overall debt balances of each borrower, for example), we’ll use it as a proxy for indebtedness. The assumption here is that lots of debt accounts carrying balances will have a mostly negative impact on credit score - albeit that’s not guaranteed, but it’s the best we have to work with in our current data set!

Well, that doesn’t explain it either. There must be some other variable I’m not accounting for here that is dictating the loan frenzy per person that we’re seeing in these states. Perhaps the percentage of the population that lives in an urban area correlates positively with the number of loans per person? That would be an interesting research path to take at a later date.

BorrowerState and LoanStatus

Finally, since loan status (especially complete pay-offs vs. defaults) is our outcome variable, no analysis of borrower states would be complete without looking at that dimension too!

Here I’ve generated both just the counts of loans by state, broken down by the loan status, as well as the loans per capita plot we saw earlier with the same breakdown. While the latter plot will change the overall size of each bar (and thus the absolute size of each bar’s portion dedicated to a given status), it will not cause the ratio of different statuses within a single state to change. As such, looking at the loans per capita plot should be sufficient for our purposes.

Interestingly, while the number of loans and loans per capita overall vary quite a bit across states, the relative proportions of each different loan status seem to be largely maintained. So, for example, the proportion of loans Completed in CT seems to scale well with the number of loans Chargedoff or Defaulted, when compared to another state like OR. The only obvious caveat from this plot is DC: it seems to have an abnormally low amount of Chargedoff and Defaulted loans relative to the number of loans Completed. This is something I’ll note for a deeper analytical dive later, as it can certainly be explored numerically without just relying on our eyeballing of the plot to derive insights, but that deep dive is beyond the exploratory data analysis objective of this current project.

Y vs. Credit Score Midpoint

OK, now that we’ve looked at the Borrower State results a bit, let’s move on to other variables. In this section, I’ll take a look at the trend between credit scores of borrowers and their loan statuses, but I’ll also take a look at how credit scores correlate with the interest rates borrowers have on their loans and their monthly payments. Earlier when looking at a bunch of different relationships using ggpairs, we saw that there was a positive correlation between monthly payments and credit scores. However, traditional wisdom has it that a higher credit score typically results in a lower borrower rate and thus you’d expect a lower monthly payment, so let’s see what’s going on here!

Let’s start with figuring out what’s going on with the credit scores and monthly payments, as that’s got the potential to be interesting.

Monthly Payments and Credit Scores

Wow, that’s really strange. Monthly payment does, largely, seem to increase with increasing credit score! How odd. I will note that a peak in the data can be observed at a credit score of 710 in the current view: it seems fair to say that likely this relationship isn’t completely linear. That being said, the overall trend up to 710 is still counterintuitive. We’ll make a note to revisit this in the Multivariate Analysis portion of this project, as that may provide some more clarity.

Loan Interest Rates and Credit Scores

OK, this is showing the behavior we expected, with a negative functional relationship between credit score and borrower rate. Interestingly, the data seem somewhat nonlinear, with higher credit scores showing less reduction in loan rates for every 10 points of credit score than lower scores. This goes to show that, on an individual basis, it’s definitely worth spending some time to increase a low score as much as you can, but for those with scores of 700+, it may not be worth the bother.

Credit Scores and Loan Status

Presumably, since a higher credit score is meant to indicate that a borrower is very good at managing their debts, we should see low scores having a high rate of defaults/late payments and high scores having a lot of Current or Completed statuses. Let’s find out.

OK, there’s a lot to see here, so let’s break things down.

First of all, here I’ve drawn a horizontal line at a score of 675, as this is around the median score for both the Chargedoff and Defaulted loans. We can see that the median scores for the different statuses stay above 675 and increase as we get to more positive statuses for the most part, keeping in line with my hypothesis about better scores being somewhat predictive of better borrowers. This trend is also observed when looking at the length of the tails belowe the 675 threshold: better statuses seem to have shorter tails (although this is more true for Current and FinalPayment than for Completed). The only caveat is the Cancelled status, but that status can apply to a wide variety of possible scenarios (e.g. federal or state policies, bankruptcies, new company policies, etc.). As such, let’s ignore that status (and it’s surprisingly flat distribution) for these purposes.

Another item that draws the eye in these violin plots is the odd shapes of the Current and Completed distributions. The frequency of the peaks seen here is 20 score points, which also happens to be the width of each credit score bin (as the original upper and lower credit score range values are always 19 points apart). As such, this natural binning of the scores makes for effectively discrete credit score values. Likely the fact that so many loans (the majority) have a status of Completed or Current, these peaks in the distributions likely exist for all loan statuses, but are simply exaggerated for these statused due to the sheer number of data points associated with those statuses.

One other oddity I notice: many statuses cutoff below a score of 600. In fact, only Completed, Cancelled, Defaulted, and Chargedoff have data below this score. Why is that? Did the marketplace not track these other statuses until some point in time wherein they imposed a credit score minimum on borrowers or something? Due to this oddity, we’ll next plot the credit scores in our data set vs. the loan listing date to see if there’s any obvious temporal changes.

Credit Score vs. Listing Date

Aha! Indeed, it looks like there is a (likely artificial) lower bound on credit scores that begins in early 2007, then increases around mid-2009, and increases once more in late 2013. It’s not clear why these changes occurred, but at least it provides the context to explain what we saw earlier when exploring Loan Status and credit scores. This also provides important context for later interpretations of our data: if the market was self-regulating to exclude lower credit scores as it took on greater numbers of loans, credit-score-related factors (e.g. likelihood of default) are being mitigated as a result and skewing our results as a function of time.

Y vs. APR and/or Borrower Rate

OK, time to explore another dimension of this. We noted earlier with ggpairs that there was negative correlation between Borrower Rate and Monthly Payment, which is just confusing. A higher rate would usually imply a higher monthly payment, but this correlation suggests otherwise.

We’ll take a look in a second to see what this is all about, but first let’s make sure we understand the relationship between APR and Borrower Rate. Based upon my undertanding of APR, it should show a similar trend to Borrower Rate, with simply a constant offset on each loan reflecting loan origination fees. As such, I’d expect a fairly high Pearson’s r for this comparison.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and BorrowerAPR
## t = 2347.7, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9897057 0.9899409
## sample estimates:
##      cor 
## 0.989824

This result is exactly as we’d expect. We have a Pearson’s r of 0.9898, indicating a nearly perfect correlation, and the only deviations we see are APRs that go higher than the corresponding Rates, again reflecting the additive nature of loan origination fees. Since other explorations in this report have already been using BorrowerRate and it doesn’t convolute in other values (such as fees) and is thus simpler than APR, we’ll use that going forward.

Monthly Payment as a Function of Rate

OK, now we’re ready to explore this strange negative relationship. Let’s go!

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and MonthlyLoanPayment
## t = -85.202, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2501933 -0.2392759
## sample estimates:
##        cor 
## -0.2447424

Hmmm…this one is definitely not as simple to explain as I had hoped. It looks at first glance like the linear fit is flawed and providing the opposite sign you might expect, but with a little alpha adjustment to deal with overplotting, I can see the negative slope in the data. Perhaps there are a lot of loans with low rates but short terms and thus a higher monthly payment? Or a similar scenario, but with high principal amounts on the loans? I’ll need to explore this a bit more in the multivariate section later. Also of note here is some kind of “stratification effect.” It’s almost as though the plot is actually an amalgam of different positively-sloped lines. I suspect we may find some pattern that explains these lines by layering in one or two other variables in the later multivariate analysis section.

Rate and Lender Yield

Based upon the data dictionary description of the LenderYield variable, “Lender yield is equal to the interest rate on the loan less the servicing fee.” So, based upon this description, it is highly likely, as in the case of APR, that this variable doesn’t tell us anything new relative to BorrowerRate, and can be ignored for the rest of this analysis. We just need to make sure that that’s not poor assumption.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LenderYield
## t = 8493.9, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9992021 0.9992204
## sample estimates:
##       cor 
## 0.9992113

Haha, r = 0.999? Yeah, LenderYield and BorrowerRate are effectively the same variable. OK, good to know!

Rate and Loan Status

Finally, let’s see if Borrower Rate and Loan Status have any obvious relationship.

Based upon the shapes of the distributions and the median rate values for each status, it looks like loans with higher rates tend to have more trouble staying in the “positive statuses”: Current, FinalPaymentInProgress, and Completed. This is unfortunate, as higher rates correlate to lower credit scores and higher rates seem to beget poor loan statuses, which then cause credit scores to fall. This seems like a vicious cycle to me, something to be kept in mind going forward.

Loan Terms vs. Loan Status

Okie dokie, let’s look now at how loan term length might relate to loan status.

Perhaps unsurprisingly due to our earlier univariate analysis, we see here that the majority of loans in any given status has a term of 36 months. From this visualization, it’s not immediately obvious that there’s any trend with the Term variable, so let’s move on.

Listing Category vs. Loan Status

One possibility: different end-uses of loans may dictate likelihood of repayment. While this is likely to largely correlate to the size of the principal on the loan, it’s possible it goes deeper than that (e.g. some reports indicate that Americans often pay off credit cards before making their mortgage payment, if strapped for cash). So let’s see what happens.

Nope, there doesn’t seem to be much of interest here either. We already knew that a large portion of the loans in the data set were oriented towards Debt Consolidation, and the only other obvious difference between positive and negative loan statuses seems to be that Current doesn’t have as large of a portion in the “Not available” category as the others do. This is likely more a function of the listing creation date, with earlier loans likely having less resolution as to what they were going to be used for, so this doesn’t seem to be a notable result.

Employment Status vs. Loan Status

As noted earlier, it will be interesting to see if employment status has an influence on the loan status (e.g. those who are employed full-time are likely to have positive loan outcomes moreso than the rest of the borrower pool). Let’s take a look.

Hmmm…this isn’t as obvious of a trend as I’d hoped. We already knew that a majority of borrowers were classified as “Employed”, but that is the overwhelming majority only for current loans. For Chargedoff, Defaulted, and Completed, there is more of a balance between Employed, and Full-time, with the latter actually being the majority. It looks like there isn’t much of a correlation here between loan and employment status.

Past Delinquencies vs. Loan Status

OK, let’s try something else. Past delinquencies should theoretically be predictive of loan performance, right? Let’s take a look at that.

Unsurprisingly, we find that the positive loan statuses have significantly fewer delinquencies in their uppermost quartiles than the negative loan status groups. That being said, almost every status has a median delinquency count of 0. The oddball here (as is often the case, we’ve seen) is the Cancelled status. I wish we had more context for those loans! Possibly they were the lowest-performing loans of the portfolio, or maybe the ones that were still in the marketplace prior to the 2009 reboot? Let’s take a look at Loan Status over time, both for listing creation and closing dates.

Listing Creation and Closing Dates vs. Loan Status

Something strange is going on with the Cancelled status. It’s sticking out as a unique loan class in a variety of ways and the only possibility I can think of is that these loans were active during the marketplace reboot in 2009. So let’s see if that falls out of the data.

That did it! OK, this makes more sense now, at least temporally. None of these Cancelled loans existed past 2009 (in fact, they seem limited to just mid-2006 except for a couple of outliers). Weirdly enough, they all seem to have started at ended in the same timeframe, with identical boxplots for both y-variables. This suggests there’s still something artificial about them (perhaps they were just test loans used as part of the software development for the marketplace?), but likely they were very poor-performing loans that Prosper cancelled as part of its reboot for some reason instead of just putting them into default, perhaps to clear out their balance sheets.

Also of note here: many of the loans with positive statuses were listed after the market reboot in 2009 and many of the loans with negative statuses were created before then. This is suggestive that the free-market auction-based system for determining interest rates that Prosper had prior to 2009 wasn’t doing a good job of reducing risks of default for investors. The new pre-determined-by-Prosper model seems to be reducing risk much more, if this is any indicator.

Monthly Loan Payment vs. Loan Status

Another seemingly intuitive relationship with loan statuses is that of the monthly payment amount. Surely a larger payment makes borrowers more likely to have a negative loan outcome, right?

Well nevermind about that theory! The median monthly loan payment amount for Chargedoff and Defaulted loans is approximately the same value as the amount for FinalPayment and Completed, and actually lower than that of loans that are Current! Even when looking at the distribution for each status as a function of monthly payment amount we see peaking at roughly the same monthly payment amounts. In fact, when accounting for outliers, the highest monthly payment amounts exist for Completed loans, totally flying in the face of my hypothesis on this one. Perhaps this variable isn’t as predictive as I had hoped. Noted.

Monthly Income vs. Loan Status

It seems like a long shot at this point, but let’s see how the loan status trends as a function of the borrower’s stated monthly income. If there is a trend here, I’d expect those with higher incomes to have an easier time paying off their loans/keeping current. Is that indeed the case?

Ummm…I’m going to go out on a limb here and assume that the incomes included for Current and Completed are probably a bit erroneous, or at the very least significant outliers. Let’s scale down the y-axis a little bit.

Well, it certainly seems like the loans with positive statuses also tend to have a higher median monthly income as well as significantly higher outlier incomes than those with negative statuses. So, at the very least this is strongly suggestive that I’m right about the relationship between loan status and monthly income! Of course, things like biases in self-reporting of incomes by the borrowers can be an issue here, but that’s more likely the reason why we have some significant income outliers, not the cause of the overall trends we’re seeing.

Loan Origination Amount vs. Loan Status

And now, for our final bivariate plot, let’s look at loan origination amounts (which we’re equating here to the loan principal) as a function of loan status. As with many of these other relationships, we expect that higher origination amounts will result in loans that are harder to pay off and thus more negative statuses, but for some reason I have a feeling that this variable (like so many before it) will turn out to be more complicated than that…

The first thing I notice here are the peaks in the distributions for each status. As we observed in the univariate analysis section, people seemed to favor loans with principals that were multiples of $5,000 (ah, the idiosyncrasies of humans’ logarithmic thinking). In addition, it’s clear that my hypothesis is again proven false. We can see here that the median values of loan principals for the positive statuses are equivalent to or higher than those for Chargedoff and Defaulted. I guess good borrowers tend to avoid small personal loans in favor of one big one? Hard to say for sure. Perhaps we need to take a look at what types of loans these were? Let’s try that.

Loan Principal vs. Listing Category

The patterns here are pretty intuitive. The top 5 categories by median value are:

  1. Debt Consolidation
  2. Baby & Adoption
  3. Wedding Loans
  4. Business
  5. Boat

Nope, nothing here seems to explain the larger median principal amounts for more positive status loans. Oh well, this may be something that should be noted for future exploration and modeling.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in
the dataset?

The primary feature of interest for this data set are the loan statuses, as they represent the current state/outcome of each loan and effects that that loan may have had on the lives of the borrowers. So all of these observations are relative to the LoanStatus variable. Here are some observations I made:

  1. Borrowers’ states of residence show some interesting trends. The number of loans overall and per capita varies quite a bit across states, but surprisingly the proportions of each state’s loan portfolio that fall into the different loan statuses aren’t that different. One anomaly here though: DC seems to have an abnormally low number of loans with negative status (I’m largely looking at Chargedoff and Defaulted statuses when I reference “negative statuses”).

  2. We do indeed see what we may expect with relation to credit scores, wherein higher scores tend to have better loan statuses (as measured by the median credit score and the length of the distribution tails into the lower score region for each status).

  3. I found a concerning trend when looking at how interest rates relate to loan statuses. It looks like there may be a vicious cycle present in these data: negative loan statuses are more common for loans with higher interest rates and lower rates are more common for loans with good statuses. Since we already know that credit scores correlate negatively with interest rates, this means that people with poor credit scores get higher rates, which make them more likely to end up defaulting on the loans, which then will cause their credit scores to go down further. This is a concerning possibility that warrants further exploration in the multivariate analysis section of this report.

  4. When looking at delinquencies in the last seven years for borrowers, we find that the positive loan statuses have significantly fewer delinquencies in their uppermost quartiles than the negative loan status groups. That being said, the median delinquency count for each status grouping seems to be zero, except for Cancelled loans.

  5. Loans in the Cancelled status are very limited in time: in fact, they’re so limited it may be that they are entirely artificial (e.g. created by a software engineer to test out the code for the online marketplace). Even if that’s not the case, the way they are so limited in time makes them highly suspicious data points and justify my ongoing assumption that they’re not an important status from which we can derive meaningful insights.

  6. Also, we find that many of the loans with positive statuses were listed after the market reboot in 2009 and many of the loans with negative statuses were created before then. This suggests that the modern interest-rate-setting model that Prosper is using has a tendency to reduce lender risk compared to the old auction-based model. This is bound to be a good thing for their business, even though they only instituted the changes due to a class action lawsuit.

  7. Monthly payment amounts don’t seem to be very predictive of loan statuses, surprisingly.

  8. Finally, it looks like borrowers with higher incomes tend to have more positive loan statuses.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Indeed I did. Some relationships between predictor variables I observed were:

  1. Borrower state vs. Listing creation date: found that a few states (IA, ME, and ND) didn’t participate in the marketplace after 2009. Also, of the states that we identified as having the highest loans per capita (DC, CT, GA, IL, MD, and OR), we saw most of their activity occurring after the 2009 Prosper marketplace reboot.

  2. Credit Scores vs. Monthly Loan Payments: these are positively correlated and show a peak in the monthly payment amount at a score of 710. This is a very odd result, given that higher scores should typically result in lower rates and thus lower monthly payments. It’s possible that this assumption is flawed however, as it’s possible that those with higher scores tend to choose shorter loan terms or higher loan principals. Likely there is indeed something going on here that goes beyond interest rates, since we also saw a negative correlation between interest rate and credit score (as we’d expect). This is something I’ve flagged for deeper analysis in the multivariate section of this report.

  3. Credit Scores vs. Interest Rates: one interesting item to note is the nonlinear decrease in interest rates with increasing scores: it’s easier to score a significant reduction in interest rate by improving a low credit score by 10 points than it is to improve a high score by the same amount. This is somewhat intuitive given the hard upper limit on credit scores that exists (850), but is still more obvious with the visualization provided herein.

  4. Credit Scores vs. Loan Listing Dates: the floor for borrower credit scores seems to have been forced higher and higher as the market aged, starting around 350-425 in the early days, then increasing to 520, 600, and 640 over time.

  5. APR and Lender Yield vs. Interest Rates: these are essentially the same variables and thus we choose to exclude APR and Lender Yield from further analysis in favor of BorrowerRate.

  6. Interest Rates vs. Monthly Payment: if the interest rate for a loan goes up, we’d expect that the monthly payment would also go up, right? Wrong. Analysis of the rate and monthly payments across the portfolio indicates a negative Pearson’s r. However, this is another case wherein likely we have a nonlinear behavior at play. I say this because, at first glance (with no alpha applied to the plot), it looks like the relationship between these variables we expect is indeed the one we observe: higher rates lead to higher monthly payments. But as soon as you start playing with the visualization to account for overplotting, we see the story isn’t so simple. In particular, an oddity we note is a sort of stratification effect: there seem to be positively-sloped “strata” in the data that may indicate different loan parameters will illuminate this further. This is also something I’ve noted needs further multivariate exploration later in this report.

What was the strongest relationship you found?

Well, if the metric for “strongest relationships” is Pearson’s r, then the strongest relationships were between APR, lender yields, and interest rates!

But beyond that, the strongest, and one of the most interesting, relationships is that between loan statuses and time: A significant portion of the loans with the most negative statuses (Defaulted or Chargedoff) correspond with loans listed prior to the 2009 Prosper marketplace reboot and a substantial amount of the loans with the most positive statuses (Current, Final Payment in Progress, or Completed) are those that were listed after the 2009 reboot. As stated earlier, this seems to suggest that risk to lenders was significantly lowered due to the change in interest-setting protocols. This is particularly interesting since this change is bound to only be a positive one for Prosper’s business model, but Prosper didn’t seek out this change: they were forced to do it because of a class-action lawsuit! This is a great example of a modification to a business model that would benefit from some rigorous A/B testing.

Multivariate Plots Section

Exploring Further: Credit Scores and Monthly Loan Payments

Earlier in the Bivariate Plots section, we observed a positive correlation between credit scores and monthly loan payments. Why would your monthly payment be higher if you were considered a less risky borrower? Most likely, interest rates have little to do with it, and there is more to unpack here. We also noted a nonlinear behavior to loan payments as a function of credit score that can hopefully be explained herein.

Hmmm…this doesn’t answer our original question, but rather spawns new questions! Why do we see less spread in credit scores for the shortest and longest terms? And why does it still seem like monthly payments go up as a function of improving credit scores? We can see that, likely, longer terms result in lower monthly payments, but it feels like we’re missing important information here. Perhaps we can add in one more visual dimension to really provide some insights…

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Here we provide 4 dimensions of information: the monthly loan payment (y-axis), the borrowers’ credit scores (x-axis), the amount of principal of each loan (size of data points, ranging from $1,000 to $35,000), and the length/term of the loan (facets). We can infer a few things from this view.

First of all, it seems like different loan terms have different minimum allowed credit scores. The 36 month term, which we know from our univariate analysis to be the majority of the loan portfolio (75%), includes scores as low as 430 or less. But the other two term lengths don’t seem to allow for credit scores below 600. Looking back at our analysis of credit scores as a function of loan listing date, we recall that all of the loans were limited to a minimum borrower credit score of 600 after 2009. Likely the 12- and 60-month terms were loan products that don’t come into existence until after 2009.

Without fail, larger principal amounts equate to higher monthly payments (which is, admittedly, intuitive). Also, it looks like higher credit scores provided borrowers with access to higher allowed principal amounts, as there seem to be discrete steps wherein a higher credit score gains access to a higher principal and a higher monthly payment. It is therefore pretty likely that monthly payment amounts are far more sensitive to the principal amount and loan term than they are to the credit score.

Exploring Further: Monthly Loan Payments and Borrower Rates

Given that the interest rate seems to correlate strongly (r = -0.462) with the credit score, we’d expect that monthly payments would exhibit a similar relationship to interest rates as they did previously with credit scores (albeit with a reversal of the sign of the correlation, given that higher credit scores should produce lower interest rates). Let’s take a closer look at that relationship, especially since earlier in the bivariate section we saw a curious relationship between monthly payments and interest rates: they were negatively correlated and we saw some kind of stratification effect in the data wherein there were different positively-sloped lines visible in the data set, potentially indicative of something fishy going on.

First I reproduce the earlier odd-looking plot of Monthly Payment vs. Rate for comparison, then I facet by Listing Category to see if, perhaps, there is some obvious deviation among the categories that can explain what’s going on here. Alas, nothing obvious pops out. In fact, the only real item of note is that the the Not Available category is the only one exhibiting the positively-sloped overall behavior that I was originally expecting.

I wonder if this is another example of loan parameters being artificially limited (e.g. as we saw in the preceding sub-section, higher credit scores were the only ones with access to higher principal amounts, explaining why those high credit score individuals are appearing to have much higher monthly payments)?

Interesting! It looks like that stratification effect that I saw earlier is potentially the result of different levels of loan principal (each positively-sloped line we can see by eye is a different iso-principal set of loans). To explore this a little more, we need to bin these principal values discretely so we can then facet the results.

It looks pretty clear at this point that, indeed, the relationship between monthly payments and interest rates is as we expected, once one accounts for differing levels of principal: higher rates equate to higher monthly payments.

Exploring Further: Borrower Rates and Loan Statuses

Earlier, when looking at borrower interest rates versus loan statuses, we noted that higher-interest loans seemed to tend towards poorer loan statuses, setting up a potentially vicious cycle: those with poor credit scores likely get higher interest rates, then likely end up with poor loan statuses, which decrease their credit score further.

Let’s dive deeper and explore the possibility of this further. First, we’ll investigate these data by adding in the variable of credit score. If what we expect to see is proven accurate, we’ll then take a deeper look at different facets of the positive and poor loan status groups, seeing if there are patterns we can discern between these populations and, hopefully, see signs that a low credit score borrower has a reasonable chance of having a positive loan outcome.

First of all, a note on this visualization: the area of each violin plot is scaled according to the actual counts represented by that plot in a given range of credit scores. So, fatter plots have a higher count of loans that they are representing, but only for that credit score bin (so the Current plot in (650,700] with the same area as the Current plot in (700,750] may not have the same counts associated with it). Also, each facet represents a range of credit scores.

Hmmm…I think these plots give us some reason for hope. While we can definitely see the correlation between high credit scores and low interest rates (and vice versa) through the median interest rate values and the bulges in the violin plots (representing the distribution of loans in that facet and loan status over interest rates), we also see that a fair proportion of the lowest credit score borrowers are still making it into the positive statuses (located to the right of the blue solid line). This is good news! It means that you can still have a reasonable chance for success with your Prosper loan, even if you find that you have a high interest rate relative to someone with the exact same loan but a higher starting credit score.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Yes! We were concerned in the bivariate section of the report that borrowers with poor credit scores were often getting high interest rates and that this was setting them up for a highly probable negative loan outcome/status, which of course would result in a lowering of their credit scores, suggesting a potential “vicious cycle of failure” that a borrower could never escape once caught up in it. Thankfully, once we faceted interest rate versus loan status by credit score bins, we found that a fair number of low-credit-score borrowers were actually still making it into positive loan statuses even with high interest rates. In fact, the positive loan statuses seemed to be the majority of the loans, except for the very lowest credit score bin. While there may still be cause for concern for that lowest score group, overall this is much better news than our original bivariate analysis had suggested!

Were there any interesting or surprising interactions between features?

We finally figured out the oddities related to the seemingly positive correlation between monthly payments and borrower credit scores! By adding in the extra dimensions of loan terms and loan principal amounts, we discovered that monthly payments were more sensitive to loan term lengths and principal than to credit scores themselves. However, we also discovered that borrowers seem to access higher principal amounts with higher credit scores. This must be a feature built into the Prosper platform, an artificial limiting of prinicipal amounts allowed to certain credit score brackets. This limiting effectively caused the counterintuitive positive correlation between credit score and monthly payment, since apparently a large number of borrowers with high credit scores decided to take advantage of the increased prinicipal ceiling they were offered.

In addition, we explored the relationship between monthly loan payments and the interest rate. This, too, had a strange correlation when treated in a bivariate manner, showing a decrease in monthly payment as interest rate increased. In addition, the eye was immediately drawn to what looked like strata in the data: positively-sloped lines, all layered upoon one another. By faceting this out and coloring, both using the loan prinicipal variable, we were able to determine that in fact these lines were likely constant-principal (or near-constant) lines, showing an increasing monthly payment with increasing interest rate, just as we were originally expecting. In this case, it looked like nothing terribly artificial was occurring due to the market’s structure, it was just a matter of too many data points and not enough context.


Final Plots and Summary

Plot One

Description One

Here we show two plots: the number of loans in our data set for each state that the borrowers are from and then the same plot, but with a twist: we’ve taken the population of each state and scaled the original counts by \(1/Population\) to give us a count of the loans per capita in each state. Vertical red bars indicate the top 6 states for loans per capita, and these same states are highlighted in the raw counts plot, highlighting how drastically different the count of loans per state is when you don’t account for population.

The upper plot doesn’t tell us a whole lot, really, beyond telling us which states Prosper sees the greatest traffic. The lower plot is the more interesting one: it tells us which states show the most interest in the personal loans offered by Prosper, instead of conflating that interest potentially with simply the differences in population between states.

Plot Two

Description Two

As discussed when we last visited this plot in the multivariate plots section, here we can see how different loans’ statuses differ based upon interest rate and borrower credit score. We have colored negative statuses with a red background and positive ones with a green background to provide a guide for the eye.

We find that, while borrowers with lower credit scores do tend to have higher interest rates for their loans, there are still a good proportion of each credit score group that have positive statuses, save for the very lowest-scoring group. This is evidence that, while your chances of having a happy ending with your loan are diminished with a lower credit score and thus a higher likely interest rate, you are by no means doomed to have a bad loan outcome (and a further diminished credit score as a result).

Plot Three

Description Three

These plots show us that loans with the middle term (36 months) have the widest range of credit scores. Looked at in this way, we realize that credit scores likely dictate the size of principal that Prosper allowed a borrower to borrow, which directly impacted their monthly payment. In addition, we see based on the coloring that the 36-month term has the largest variance in loan creation dates, which is unsurprising given our earlier analysis of borrower credit scores as a function of the loan creation date. One final interesting item is that the average size of a loan seems to go up with increasing term length. As a longer term reduces the monthly payment amount, this isn’t a terribly surprising result, as I would expect most people would want to amortize their loan over the longest time possible to minimize each month’s payment, even though that means they’re spending more on interest over the life of the loan.

I’d also like to point out here how cool it is that this plot covers 5 dimensions: 1. Monthly payment amount 2. Credit score 3. Loan term 4. Loan principal amount 5. Loan creation date

Hopefully you’ll agree that it does so in an intuitive manner!


Reflection

This loans data set has almost 114,000 loans in it and the full set of data includes 81 different variables, with the loans all coming from Prosper.com and covering the time period of 2006 - 2014. That’s a lot of loans and variables to investigate! I chose to condense my investigation down to 16 variables that I thought had the potential to be the most interesting, both in terms of my chosen outcome variable LoanStatus and in relation to one another.

I found trends that I both expected and that surprised me. For example, I expected to find a positive correlation between interest rates and monthly loan payments. While it took some time and deeper exploration than I expected, this expectation proved valid in the end. However, I was surprised to find that there were some externalities present in the data that required a modificaiton of some of my hypotheses. For example, the 2009 reboot of the Prosper marketplace wherein they significantly changed their mechanism for setting the interest rate of a listing had deep ramifications on the data. I discovered that, not only did activity in the platform grind to a halt in early 2009 and then slowly start from scratch again midway through that year, but apparently limitations of different kinds were applied before and after that point in time (e.g. credit score floors appear to have been implemented that year and in later years). These revelations caused me to treat the time series variables (ListingCreationDate and ClosedDate) with far more caution than I would ever expected. In addition, a particularly useful trend I observed (especially useful to Prosper’s business practices, I’m sure) was that loans with positive statuses were far more common after the market reboot than beforehand. This suggests that the pre-2009 auction-style rate-setting scheme Prosper initially implemented generated much higher-risk loans than the new format, likely resulting in far more pleased lenders on the platform.

There were so many variables in the original data that I didn’t explore that those are a ripe area for future exploration (e.g. a number of variables from borrowers’ credit reports were omitted in this exploration). Additionally, had I not needed to spend so much time on pure exploration of this large data set, it would have been interesting to build a model to predict individual borrower risk of default. There are so many variables at play in this data set that likely a simple linear regression approach would not capture the necessary nuances, but it would certainly be interesting to try and do some kind of out-of-sample testing and prediction. Also, since these data stop in early 2014 and it is now mid-2018, it’s possible that more information could be gleaned from a data set update. I could imagine asking questions like “10 years out of the US market collapse, do we see signs of accelerating or decelerating economic recovery, using Prosper activity as a proxy?”